A Framework for Characterizing Feature Weighting and Selection Methods in Text Classification
نویسندگان
چکیده
Optimizing performance of classification models often involves feature selection to eliminate noise from the feature set or reduce computational complexity by controlling the dimensionality of the feature space. A refinement of the feature set is typically performed in two steps: by scoring and ranking the features and then applying a selection criterion. Empirical studies that explore the effectiveness of feature selection methods are typically limited to identifying the number or percentage of features to be retained in order to maximize the classification performance. Since no further characterizations of the feature set are considered beyond its size, we currently have a limited understanding of the relationship between the classifier performance and the properties of the selected set of features. This paper presents a framework for characterizing feature weighting methods and selected features sets and exploring how these characteristics account for the performance of a given classifier. We illustrate the use of two feature set statistics: cumulative information gain of the ranked features and the sparsity of data representation that results from the selected feature set. We apply a novel approach of synthesizing ranked lists of features that satisfy given cumulative information gain and sparsity constraints. We show how the use of synthesized rankings enables us to investigate the degree to which the feature set properties explain the behaviour of a classifier, e.g., Naïve Bayes classifier, when used in conjunction with different feature weighting schemes.
منابع مشابه
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملA New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملA Novel One Sided Feature Selection Method for Imbalanced Text Classification
The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005